An unsupervised approach to language identification
نویسندگان
چکیده
This paper presents an unsupervised approach to Automatic Language Identification (ALI) based on vowel system modeling. Each language vowel system is modeled by a Gaussian Mixture Model (GMM) trained with automatically detected vowels. Since this detection is unsupervised and language independent, no labeled data are required. GMMs are initialized using an efficient data-driven variant of the LBG algorithm: the LBG-Rissanen algorithm. With 5 language from the OGI MLTS corpus and in a close set identification task, we reach 79 % of correct identification using only the vowel segments detected in 45 second duration utterances for the male speakers.
منابع مشابه
Automatic Language Identification: An Alternative Unsupervised Approach Using a New Hybrid Algorithm
This paper deals with our research on unsupervised classification for automatic language identification purpose. The study of this new hybrid algorithm shows that the combination of the Kmeans and the artificial ants and taking advantage of an n-gram text representation is promising. We propose an alternative approach to the standard use of both algorithms. A multilingual text corpus is used to...
متن کاملIdentification of Power Stripping Resources with Fuzzy Cluster Dynamic Approach (Case Study: West Azerbaijan Province)
Reducing electric power theft is a significant part of the potential benefits of implementing the concept of smart grid. This paper proposes a data-based approach to identify locations with unusual electricity consumption. The new distance-based method classifies the new data as violator costumers, if their distance is long to the primary consumption data. The proposed algorithm determines the ...
متن کاملComparison of two phonetic approaches to language identification
This paper presents two unsupervised approaches to Automatic Language Identification (ALI) based on a segmental preprocessing. In the Global Segmental Model approach, the language system is modeled by a Gaussian Mixture Model (GMM) trained with automatically detected segments. In the Phonetic Differentiated Model approach, an unsupervised detection vowel/non vowel is performed and the language ...
متن کاملطبقه بندی و شناسایی رخسارههای زمینشناسی با استفاده از دادههای لرزه نگاری و شبکههای عصبی رقابتی
Geological facies interpretation is essential for reservoir studying. The method of classification and identification seismic traces is a powerful approach for geological facies classification and distinction. Use of neural networks as classifiers is increasing in different sciences like seismic. They are computer efficient and ideal for patterns identification. They can simply learn new algori...
متن کاملUnsupervised language filtering using the latent dirichlet allocation
To automatically build from scratch the language processing component for a speech synthesis system in a new language a purified text corpora is needed where any words and phrases from other languages are clearly identified or excluded. When using found data and where there is no inherent linguistic knowledge of the language/languages contained in the data, identifying the pure data is a diffic...
متن کامل